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Abstract 

Principal component analysis (PCA) is a 
popular tool for linear dimensionality reduc- 
tion and feature extraction. Kernel PCA is 
the nonlinear form of PCA, which is promis- 
ing in exposing the more complicated corre- 
lation between original high-dimensional fea- 
tures. In this paper, we first talk about the 
basic ideas of PCA and kernel PCA, and then 
focus on the reconstruction of pre-images for 
kernel PCA. We also give an introduction 
on how PCA is used in active shape models 
(ASMs), and discuss how kernel PCA can be 
applied to improve traditional ASMs. Then 
we show some experiment results to compare 
the performance of kernel PCA and tradi- 
tional PCA for pattern classification. We also 
implement the kernel PCA-based ASMs, and 
use it to construct human face models. 



1. Introduction 

In this section, we briefly review the principal compo- 
nent analysis method and the active shape models. 

1.1. Principal Component Analysis 

Principal component analysis, or PCA, is a very popu- 
lar technique for dimensionality reduction and feature 
extraction. PCA attempts to find a linear subspace 
of lower dimensionality of the original feature space 
in which the new features have the largest variance 
(Bishop, 2006). 

Consider a dataset {x n } where n = 1, 2, . . . , N, and x„ 
is a D-dimensional vector. Now we want to project the 
data onto an M-dimensional subspace where M < D. 
We assume the projection is denoted as y = Ax where 
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A = [uj , . . . , ujj] and u^u^ = 1 for i = 1, 2, . . . , M. 
We want to maximize the variance of y„, which is the 
trace of the covariance matrix of y„. Thus, we want 
to find 

maxtr(S y ), (1) 

where 

1 N 

s y = ^E(y«-y)(y«-y) T < ( 2 ) 

n=l 

and 

1 N 

n=l 

Let S x be the covariance matrix of x n . Since tr(S y ) = 
tr(AS x A T ), by using the Lagrangian multiplier and 
taking the derivative, we get 

S x u, = AjUj, (4) 

which means that Uj is an eigenvector of S x . Now x„ 
can be represented as 

D 

1=1 

x„ can be also approximated by 

M 

X n = ( x « u u " ( 6 ) 

i=l 

where is the eigenvector of S x corresponding to the 
ith largest eigenvalue. 

1.2. Active Shape Models 

The active shape model, or ASM, is one of the most 
popular top-down computer vision approaches. It is 
designed to discover the hidden deformation patterns 
of the shape of object or region of interest (ROI), and 
to locate the object or ROI in new images. ASMs use 
the point distribution model (PDM) to describe the 
shape (Cootes, 1995). If a shape consists of n points, 
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and the coordinates of the ith point are (a^J/i), then 
the shape can be represented as a vector 

x=[x 1 ,y 1 ,...,x n ,y n ] T . (7) 

To simplify the problem, we now assume that all 
shapes have already been aligned. Otherwise, a rota- 
tion by 9, a scaling by s and a translation by t should 
be applied to x. Given N aligned shapes as training 
data, the mean shape can be calculated by 



1 N 
N ^ 

i=l 



(8) 



For each shape Xj in the training set, its deviation from 
the mean x is 

dx 4 = Xj - x. (9) 
Then the 2nx 2n covariance matrix S can be calculated 
by 



1 N 

s = — dx * dx 



(10) 



Now we perform the PCA on S. Assume that 

S Pfc = A fcPfe , (11) 

where p& is the eigenvector of S corresponding to the 
feth largest eigenvalue, and 

plPk = 1. (12) 
Let P be the matrix of the first t eigenvectors: 

P = [pi,P2, -..,Pt]. (13) 

Then we can approximate a shape in the training set 

as 

x = x + Pb, (14) 

where b = [bi, b%, . . . , b t ] T is the vector of weights for 
different PCA features. By varying the parameters bk, 
we can generate new examples of the shape. We can 
also limit bk to constrain the deformation patterns of 
the shape. Typical limits are 



3x/X k <h < 3\/\ 



(15) 



where k = 1,2, ... ,t. An important issue of AMSs is 
using point distribution models to search for the shape 
in images. We do not talk about this problem in our 
paper, and focus on the statistic model. 

2. Kernel PCA 

Traditional PCA only allows linear dimensionality re- 
duction. However, if the data has more complicated 
structures which cannot be simplified in a linear sub- 
space, traditional PCA will become invalid. Fortu- 
nately, kernel PCA allows us to generalize traditional 
PCA to nonlinear dimensionality reduction (Scholkopf 
ct a!., 1999). 



2.1. Constructing the Kernel Matrix 

Assume we have a nonlinear transformation </>(x) from 
the original D-dimensional feature space to an M- 
dimensional feature space, where usually M 3> D. 
Then each data point x„ is projected to a point (/>(x n ). 
We can perform the traditional PCA in the new fea- 
ture space, but this might be extremely costly. Thus 
kernel methods are used to simplify the computation 
(Schlkopf et al., 1996). 

First we assume that the projected new features have 
zero mean: 

$>(x„) = 0. (16) 

n 

The covariance matrix of the projected features is M x 
M, calculated by 



1 N 

C = A7 E ^)^n? 



(17) 



and its eigenvalues and eigenvectors are given by 

Cv, = A lVl , (18) 
where i = 1,2, ... , M . From (17) and (18), we have 

1 N 

— <Kx„){0(x„) T Vi} = AjVi, (19) 

n=l 

which can be written as 



(20) 



Now by substituting Vj in (19) with (20), we have 

^ N N N 

AT 0( X «)<K X ™) T E a im<?K x m) = A * E a "M x n) 



N 

By defining the kernel function 

fc(x„,x m ) = 0(x n ) T 0(x m ), 



(21) 



(22) 



and multiplying both sides of Equation (21) by (/>(x;) T , 
we have 

N N N 

77 ^ ' k (X; , X n ) S ' dim k (x n , X m ) = Aj ^ ' Clin k (x/ , X n ) , 



N 

ri— 1 m—1 

or the matrix notation 

K 2 a, = A 4 AKa 4 , 

where 



(23) 
(24) 
(25) 
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and is the iV-dimensional column vector of a n i. a.j 
can be solved by 

Ka, = XiNai, (26) 

and the resulting kernel principal components can be 
calculated using 

N 

2/i(x) = </>(x) T Vj = ^2 a in k(x,x n ). (27) 

n=l 

If the projected dataset {0(x„)} does not have zero 
mean, we can use the Gram matrix K to substitute 
the kernel matrix K. The Gram matrix is given by 

K^K-IjvK-KIat + IatKItv, (28) 

where ljy is the N x N matrix with all elements equal 
to 1/N (Bishop, 2006). 

The power of kernel methods is that we do not have 
to compute </>(x„) explicitly. We can directly con- 
struct the kernel matrix from the training data set 
{x„} (Weinberger ct al., 2004). Two commonly used 
kernels are the polynomial kernel 

fc(x,y) = (x T y) d , (29) 

or 

Mx,y) = (x T y + c) d , (30) 
where c > is a constant, and the Gaussian kernel 

fc(x, y )=cxp(-||x-y|| 2 /2a 2 ) (31) 

with parameter a. 

The standard steps of kernel PCA dimensionality re- 
duction can be summarized as: 

1. Construct the kernel matrix K from the training 
data set {x„} using (25). 

2. Compute the Gram matrix K using (28). 

3. Use (26) to solve for the vectors (substitute K 
with K). 

4. Compute the kernel principal components y^(x) 
using (27). 

2.2. Reconstructing Pre-Images 

So far, we have discussed how to generate new features 
2/i(x) using kernel PCA. This is enough for applications 
such as feature extraction and pattern classification. 
However, for some other applications, we need to re- 
construct the pre-images {x„} from the kernel PCA 



features {y rl }. This is the case in active shape mod- 
els, where we not only need to use PCA features to 
describe the deformation patterns, but also have to 
reconstruct the shapes from the PCA features (Romd- 
hani ct al., 1999; Twining & Taylor, 2001). 

In traditional PCA, the pre-image x„ can simply be 
approximated by Equation (6). However, this cannot 
be achieved for kernel PCA (Bakr et al., 2004). Now 
we define a projection operator P n which projects </>(x) 
to its approximation 

n 

P„0(x) = 5^y i (x)v i , (32) 

i=l 

where v, is the eigenvector of the C matrix, which is 
define by Equation (17). If n is large enough, we have 
P n (j>(x) w </>(x). Since finding the exact pre-image x is 
difficult, we turn to find an approximation z such that 

0(z) a P„0(x). (33) 

This can be approximated by minimizing 

p(z) = ||0(z)-P n( Hx)|| 2 . (34) 

2.3. Pre-Images for Gaussian Kernels 

There are some existing techniques to compute z for 
specific kernels (Mika ct al., 1999). For a Gaussian 
kernel fc(x,y) = exp (— ||x — y|| 2 /2cr 2 ) , z should sat- 
isfy 

N 

7* ex P (HI 2 - xJ 2 / 2 ^ 2 ) x 4 

z = ^ , (35) 

£ 7iexp(-||z - x j; || 2 /2cr 2 ) 

i=l 

where 

n 

7< = ^Ukaik- (36) 

k=l 

We can use an iterative manner to compute z: 

N 

J2 7*exp (~\\z t - x 1 || 2 /2ct 2 ) Xj 

t+i = ^ ■ (37) 

E^expHK-XiPAr 2 ) 

i=l 

3. Experiments 

In this section, we show the setup and results of our 
three experiments. The first two experiments are clas- 
sification problems without pre-image reconstruction. 
The third experiment combines active shape models 
with kernel PCA, and involves the pre-image recon- 
struction algorithm. 
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3.1. Pattern Classification for Synthetic Data 

Before we work on real data, we would like to gener- 
ate some synthetic datasets and test our algorithm on 
them. In this paper, we use the two-concentric-spheres 
data. 

3.1.1. Data Description 

We assume that we have equal number or data points 
uniformly distributed on two concentric sphere sur- 
faces. If N is the total number of all data points, then 
we have N/2 class 1 points on a sphere of radius r±, 
and N/2 class 2 points on a sphere of radius r 2 . In 
the spherical coordinate system, the inclination (polar 
angle) 9 is uniformly distributed in [0, it] and the az- 
imuth (azimuthal angle) <f> is uniformly distributed in 
[0, 2tt) for both classes. Our observations of the data 
points are the (x, y, z) coordinates in the Cartesian 
coordinate system, and all the three coordinates are 
perturbed by a Gaussian noise of standard deviation 
o-noiso- We set N = 200, n = 10, r 2 = 15, er noiso = 0.1, 
and give a 3D plot of the data in Figure 1. 



two concentric spheres data 



and a Gaussian kernel with a — 20. The results of tra- 
ditional PCA, polynomial kernel PCA and Gaussian 
kernel PCA are given in Figure 2, Figure 3, and Fig- 
ure 4 respectively. 
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Figure 2. Traditional PCA results for the two-concentric- 
spheres synthetic data. 
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Figure 1. 3D plot of the two-concentric-spheres synthetic 
data. 

3.1.2. PCA and Kernel PCA Results 

To visualize our results, we project the original 3- 
dimcnsional data into a 2-dimensional feature space 
by using traditional PCA and kernel PCA respectively. 
For kernel PCA, we use a polynomial kernel with d = 5 



Figure 3. Polynomial kernel PCA results for the two- 
concentric-spheres synthetic data with d = 5. 

We note that here though we mark points in differ- 
ent classes with different colors, we are actually doing 
unsupervised learning. Neither PCA nor kernel PCA 
takes the class labels as their input. 

In the results we can see that, traditional PCA does 
not reveal any structural information of the original 
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Figure 4. Gaussian kernel PCA results for the two- 
concentric-spheres synthetic data with a — 20. 



data. For polynomial kernel PCA, in the new feature 
space, class 1 data points are clustered while class 2 
data points are scattered. But they are still not lin- 
early separable. For Gaussian kernel PCA, the two 
classes are completely linearly separable, and both fea- 
tures can reveal the radius information of the original 
data. 

3.2. Classification for Aligned Human Face 
Images 

After we have tested our algorithm on synthetic data, 
we would like to use it for real data classification. Here 
we use PCA and kernel PCA to extract features from 
human face images, and use the simplest linear classi- 
fier for classification. Then we compare the error rates 
of using PCA and kernel PCA. 

3.2.1. Data Description 

For this task, we use images from the Yale Face 
Database B (Gcorghiades et al., 2001), which contains 
5760 single light source gray-level images of 10 sub- 
jects, each seen under 576 viewing conditions. We 
take 51 images of the first and third subject respec- 
tively as the training data, and 13 images of each of 
them as testing data. Then all the images are aligned, 
and each has 168 x 192 pixels. Sample images of the 
Yale Face Database B are shown in Figure 5. 

3.2.2. Classification Results 

We use the 168 x 192 pixel intensities as the original 
features for each image, thus the original feature is 




Figure 5. Sample images from the Yale Face Database B. 



Table 1. Classification error rates on training data and 
testing data for traditional PCA and Gaussian kernel PCA 
with er = 45675. 



Error Rate 


PCA 


Kernel PCA 


Training Data 
Testing Data 


6.86% 
19.23% 


5.88% 
11.54% 



32256-dimensional. Then we use PCA and kernel PCA 
to extract the 10 most significant features from the 
training data, and record the eigenvectors. 

For traditional PCA, only the eigenvectors are needed 
to extract features from testing data. For kernel PCA, 
both the eigenvectors and the training data are needed 
to extract features from testing data. Note that for 
traditional PCA, there are particular fast algorithms 
to compute the eigenvectors when the dimensionality 
is much larger than the number of data points (Bishop, 
2006). 

For kernel PCA, we use a Gaussian kernel with a = 
45675 (we will talk about how to select the parameters 
in Section 4). For classification, we use the simplest 
linear classifier. The training error rates and the test- 
ing error rates for traditional PCA and kernel PCA 
are given in Table 1. We can see that Gaussian kernel 
PCA achieves much lower error rates than traditional 
PCA. 

3.3. Kernel PCA-Based Active Shape Models 

In ASMs, the shape of an object is described with point 
distribution models, and traditional PCA is used to 
extract the principal deformation patterns from the 
shape vectors {x,;}. If we use kernel PCA instead of 
traditional PCA here, it is promising that we will be 
able to discover more hidden deformation patterns. 
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3.3.1. Data Description 



In our work, we use Tim Cootes' manually annotated 
points of 1521 human face images from the BioID 
database. For each face image, 20 feature points (land- 
marks) are labelled, as shown in Figure 6. Thus the 
original feature vector for each image is 40-dimensional 
(two coordinates for each landmark). 









"*fPV 



1 




Figure 6. The 20 manually labelled points for an image 
(286 x 384) from BioID. 



3.3.2. Experiment Results 

In our work, we first normalize all the shape vectors by 
restricting both the x coordinates and y coordinates in 
the range [0, 1]. Then we perform PCA and Gaussian 
kernel PCA on the normalized shape vectors. For tra- 
ditional PCA, the reconstruction of the shape is given 
by 

x = x + Pb. (38) 

For kernel PCA, the reconstruction of the shape is 
given by 

z = r(y), (39) 

where r(y) denotes the reconstruction algorithm de- 
fined by (37). 

For traditional PCA, we focus on studying the defor- 
mation pattern associated with each component of b. 
That is to say, each time we uniformly select different 
values of bk in [—3y/Xk, 3y/Xk\, and set by = for all 
k' k. The effect of varying the first PCA feature 
and the second PCA feature are shown in Figure 7 
and Figure 8 respectively. The face is drawn by using 
line segments to represent eye brows, eyes, the nose, 
using circles to represent eye balls, using a quadrilat- 
eral to represent the mouth, and fitting a parabola to 
represent the contour of the face. 

For Gaussian kernel PCA, we set a = 0.7905 for the 
Gaussian kernel (the parameter selection method will 
be given in Section 4). To study the effects of each 
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Figure 7. The effect of varying the first PCA feature for 
ASM. 



feature extracted with kernel PCA, we compute the 
mean y^ and the standard deviation a y k of all the ker- 
nel PCA features. Each time, we uniformly select yt 
in [y~k — ccjyk,yk + ca y k] where c > is a constant, 
and set yy = yy for all k! ^ k. The effect of varying 
the first Gaussian kernel PCA feature and the second 
Gaussian kernel PCA feature are shown in Figure 9 
and Figure 10 respectively. 

By observation, we can see that the first PCA feature 
affects the orientation of the human face, and the sec- 
ond PCA feature to some extent determines some mi- 
croexpression from amazement to calmness of the hu- 
man face. In contrast, the first Gaussian kernel PCA 
feature seems to be determining some microexpression 
from confidence to fear, while the second Gaussian ker- 
nel PCA feature contains both orientation information 
and some microexpression. 

4. Discussion 

In this section, we address two concerns: first, how to 
select the parameters for Gaussian kernel PCA; sec- 
ond, what is the intuitive explanation of Gaussian ker- 
nel PCA. 

4.1. Parameter Selection 

Parameter selection for kernel PCA directly deter- 
mines the performance of the algorithm. For Gaus- 
sian kernel PCA, the most important parameter is the 
a in the kernel function defined by (31). The Gaussian 
kernel relies on the distance ||x — y|| between two vec- 
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Figure 8. The effect of varying the second PCA feature for 
ASM. 



Figure 9. The effect of varying the first Gaussian kernel 
PCA feature for ASM. 



tors. If the distance between these two vectors is too 
large, the value of fc(x, y) will be close to zero. Thus, 
a is used to enhance the capture range of the kernel 
function. If a is too small, all kernel values will be 
close to zero, and the kernel PCA will fail to extract 
any information of the structure of the data. We hope 
that for a given testing data y and the training data 
set {x„}, the kernel function gives at least several val- 
ues of fc(x n ,y) that are significantly larger than zero. 
Thus we are interested in min ||y — x ||. However, the 

n 

parameter must be determined in training, thus we de- 
fine the median minimal distance value MedMin of the 
training data by 



MedMin ■■ 



median min | 



(40) 



We use the median value instead of the maximal or 
mean value here to exclude some noisy data points 
that are far away from other data points. Thus for the 
parameter a, we require it to be significantly larger 
than MedMin. In our work, we select a by setting 



a = 10 X MedMin, 



(41) 



which turns out to have good performance for our syn- 
thetic data classification task and human face images 
classification task. 

For the pre-image reconstruction of Gaussian kernel 
PCA, the initial guess Zo will determine whether the 
iterative algorithm (37) converges. We can simply use 
the mean of the training data as the initial guess: 



zo 



(42) 



4.2. Intuitive Explanation of Gaussian Kernel 
PCA 

We can see that in our synthetic data classification 
experiment, Gaussian kernel PCA with a properly 
selected parameter a can perfectly separate the two 
classes in an unsupervised manner, which is not pos- 
sible for traditional PCA. In the human face images 
classification experiment, Gaussian kernel PCA has a 
lower training error rate and a much lower testing er- 
ror rate than traditional PCA. From these two exper- 
iments, we can see that Gaussian kernel PCA reveals 
more complex hidden structures of the data than tra- 
ditional PCA. An intuitive understanding of the Gaus- 
sian kernel PCA is that it makes use of the distance 
between different training data points, which is like k- 
nearest neighbor or clustering methods. With a well 
selected a, Gaussian kernel PCA will have a proper 
capture range, which will enhance the connection be- 
tween the data points that are close to each other in 
the original feature space. Then by applying eigenvec- 
tor analysis, the eigenvectors will describe the direc- 
tion in a high-dimensional space in which the different 
clusters of data are scattered to the greatest extent. 

In this paper we are mostly using Gaussian kernel for 
kernel PCA, this is because it is intuitive, easy to im- 
plement, and possible to reconstruct the pre-images. 
However, we indicate that there are techniques to find 
more powerful kernel matrices by learning (Weinberger 
et al, 2004; 2005). 
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Figure 10. The effect of varying the second Gaussian ker- 
nel PCA feature for ASM. 

5. Conclusion 

In this paper, we discussed the theories of PCA, ker- 
nel PCA and ASMs. Then we focused on the pre- 
image reconstruction for Gaussian kernel PCA, and 
used this technique to design kernel PCA-based ASMs. 
We tested kernel PCA dimensionality reduction on 
synthetic data and human face images, and found 
that Gaussian kernel PCA succeeded in revealing more 
complicated structures of data than traditional PCA 
and achieving much lower classification error rates. 
We also implemented the Gaussian kernel PCA-based 
ASM and tested it on human face images. We found 
that Gaussian kernel PCA-based ASMs are promis- 
ing in providing more deformation patterns than tradi- 
tional ASMs. A potential application is that we could 
combine traditional ASMs and Gaussian kernel PCA- 
based ASMs for microexpression recognition on human 
face images. Besides, we proposed a parameter selec- 
tion method to find the proper parameters for Gaus- 
sian kernel PCA, which works well in our experiments. 
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